date()
## [1] "Mon Nov 28 14:33:56 2022"
First we install/use R packages we need to complete the assignment.
#install.packages('ggplot2')
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(vtable)
## Warning: package 'vtable' was built under R version 4.2.2
## Loading required package: kableExtra
## Warning: package 'kableExtra' was built under R version 4.2.2
library(MASS)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.92 loaded
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Create a new R Markdown file chapter4.Rmd and include it as a child file in your ‘index.Rmd’ file.
Load Boston data from MASS package. Explore structure, dimensions and describe the dataset briefly. (0-1 points)
library(MASS)
data("Boston") #download the Boston data
dim(Boston) #dimensions
## [1] 506 14
str(Boston) #structure
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Interpretation of the results
Boston dataset has 506 rows (participants/data points)and 14 columns (variables)
All the variables are numeric, but some might be ordinal scaling rather than interval or absolute scaling. More detailed description of the dataset can be found here
gg = ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size=2)), title="Graphical overview of Boston data")
gg1 = gg + theme(axis.text = element_text(size=5), strip.text.x = element_text(size = 5), strip.text.y = element_text(size=5)) #change the size of the text
gg1
Interpretation of the figure
There is quite many graphs in display, so it is a bit hard to read the results:
Only few variables, such as rm (average number of rooms per dwelling) and medv (median value of owner-occupied homes in $1000s) seem to be normally distributed. All the other variables, except chas (Charles River dummy variable) are more or less continuous variables. There seem to be several linear relationship between variables, but also many non-linear relationships can be detected. For example, based on the distributions and scatter plots crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft.) and black: (1000(Bk−0.63)) seem to be more exponential rather than normal distribution. Lot of the correlations between the variables are weak, yet significant. However, if variables are not normally distributed calculating Pearsons correlations is not the most informative way to explore the dataset.
Another way to illustrate the relationships between the variables is to use corrplot()-command that is part of corrplot-package.
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle")
Interpretation of the plot
The strength of the correlation is illustrated by adjusting the size and transparency of the correlations:
As you can see in the picture, most relatinships are either unsignificant or atleast weak. This is due to the non-linear properties of most of the variables in Boston dataset.
st(Boston) # the command prints a summary statistics table to Viewer-window
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| crim | 506 | 3.614 | 8.602 | 0.006 | 0.082 | 3.677 | 88.976 |
| zn | 506 | 11.364 | 23.322 | 0 | 0 | 12.5 | 100 |
| indus | 506 | 11.137 | 6.86 | 0.46 | 5.19 | 18.1 | 27.74 |
| chas | 506 | 0.069 | 0.254 | 0 | 0 | 0 | 1 |
| nox | 506 | 0.555 | 0.116 | 0.385 | 0.449 | 0.624 | 0.871 |
| rm | 506 | 6.285 | 0.703 | 3.561 | 5.886 | 6.624 | 8.78 |
| age | 506 | 68.575 | 28.149 | 2.9 | 45.025 | 94.075 | 100 |
| dis | 506 | 3.795 | 2.106 | 1.13 | 2.1 | 5.188 | 12.126 |
| rad | 506 | 9.549 | 8.707 | 1 | 4 | 24 | 24 |
| tax | 506 | 408.237 | 168.537 | 187 | 279 | 666 | 711 |
| ptratio | 506 | 18.456 | 2.165 | 12.6 | 17.4 | 20.2 | 22 |
| black | 506 | 356.674 | 91.295 | 0.32 | 375.378 | 396.225 | 396.9 |
| lstat | 506 | 12.653 | 7.141 | 1.73 | 6.95 | 16.955 | 37.97 |
| medv | 506 | 22.533 | 9.197 | 5 | 17.025 | 25 | 50 |
Interpretation of the results
The summary table displays the summary statistics (n, mean, std, min-max, 25% and 75% percentiles). There are no missing values in the data set. The summary statistics can also indicate the distribution/linearity of the variables.
For example, we know that crim has exponential distribution. The data supports this:
Whereas, for rm is normally distributed:
25% and 75% are close to min and max and mean is located somewhere in the middle. One way to get rid of the issue on non-linearity is to standardize our data, so it will become “normally distributed”.
boston_scaled_KS <- as.data.frame(scale(Boston))
st(boston_scaled_KS)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| crim | 506 | 0 | 1 | -0.419 | -0.411 | 0.007 | 9.924 |
| zn | 506 | 0 | 1 | -0.487 | -0.487 | 0.049 | 3.8 |
| indus | 506 | 0 | 1 | -1.556 | -0.867 | 1.015 | 2.42 |
| chas | 506 | 0 | 1 | -0.272 | -0.272 | -0.272 | 3.665 |
| nox | 506 | 0 | 1 | -1.464 | -0.912 | 0.598 | 2.73 |
| rm | 506 | 0 | 1 | -3.876 | -0.568 | 0.482 | 3.552 |
| age | 506 | 0 | 1 | -2.333 | -0.837 | 0.906 | 1.116 |
| dis | 506 | 0 | 1 | -1.266 | -0.805 | 0.662 | 3.957 |
| rad | 506 | 0 | 1 | -0.982 | -0.637 | 1.66 | 1.66 |
| tax | 506 | 0 | 1 | -1.313 | -0.767 | 1.529 | 1.796 |
| ptratio | 506 | 0 | 1 | -2.705 | -0.488 | 0.806 | 1.637 |
| black | 506 | 0 | 1 | -3.903 | 0.205 | 0.433 | 0.441 |
| lstat | 506 | 0 | 1 | -1.53 | -0.799 | 0.602 | 3.545 |
| medv | 506 | 0 | 1 | -1.906 | -0.599 | 0.268 | 2.987 |
When we standardize variables, we set the same scale for all variables, which allows you to compare scores between different types of variables. When standardizing we set the mean = 0 and std = 1, as can be seen on the summary statistics. Negative values are smaller than mean and positive higher than the mean.
# Create categorical variable
boston_scaled_KS$crim <- as.numeric(boston_scaled_KS$crim) #quantiles are Pctl.25 = -0.411 and Pctl.75 = 0.007
bins <- quantile(boston_scaled_KS$crim) #using quantile (counts min, 25%, 50% (median), 75% and 100%)
bins # save it as a vector, so you can use thos as a cut-offs
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# 0% 25% 50% 75% 100%
# -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(boston_scaled_KS$crim, breaks = bins, labels=c("low", "med_low", "med_high", "high"), include.lowest = TRUE) #categorical variable of crime
summary(crime) #you can see that variable now has 4 even categories (1=0-25%, 2=25%-50%, 3=50%-75%, 4=75%-100%)
## low med_low med_high high
## 127 126 126 127
# Drop crim and add crime
boston_scaled_KS <- dplyr::select(boston_scaled_KS, -crim) #discard the old crim variable using -crim in the scaled Boston data
glimpse(boston_scaled_KS) #sanity-check that crim has been deleted
## Rows: 506
## Columns: 13
## $ zn <dbl> 0.28454827, -0.48724019, -0.48724019, -0.48724019, -0.48724019…
## $ indus <dbl> -1.2866362, -0.5927944, -0.5927944, -1.3055857, -1.3055857, -1…
## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0…
## $ nox <dbl> -0.1440749, -0.7395304, -0.7395304, -0.8344581, -0.8344581, -0…
## $ rm <dbl> 0.4132629, 0.1940824, 1.2814456, 1.0152978, 1.2273620, 0.20689…
## $ age <dbl> -0.11989477, 0.36680343, -0.26554897, -0.80908783, -0.51067434…
## $ dis <dbl> 0.1400749840, 0.5566090496, 0.5566090496, 1.0766711351, 1.0766…
## $ rad <dbl> -0.9818712, -0.8670245, -0.8670245, -0.7521778, -0.7521778, -0…
## $ tax <dbl> -0.6659492, -0.9863534, -0.9863534, -1.1050216, -1.1050216, -1…
## $ ptratio <dbl> -1.4575580, -0.3027945, -0.3027945, 0.1129203, 0.1129203, 0.11…
## $ black <dbl> 0.4406159, 0.4406159, 0.3960351, 0.4157514, 0.4406159, 0.41016…
## $ lstat <dbl> -1.07449897, -0.49195252, -1.20753241, -1.36017078, -1.0254866…
## $ medv <dbl> 0.15952779, -0.10142392, 1.32293748, 1.18158864, 1.48603229, 0…
boston_scaled_KS <- data.frame(boston_scaled_KS, crime) #add the new crime categorical variable
glimpse(boston_scaled_KS) #sanity-check that crime exist in the dataset
## Rows: 506
## Columns: 14
## $ zn <dbl> 0.28454827, -0.48724019, -0.48724019, -0.48724019, -0.48724019…
## $ indus <dbl> -1.2866362, -0.5927944, -0.5927944, -1.3055857, -1.3055857, -1…
## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0…
## $ nox <dbl> -0.1440749, -0.7395304, -0.7395304, -0.8344581, -0.8344581, -0…
## $ rm <dbl> 0.4132629, 0.1940824, 1.2814456, 1.0152978, 1.2273620, 0.20689…
## $ age <dbl> -0.11989477, 0.36680343, -0.26554897, -0.80908783, -0.51067434…
## $ dis <dbl> 0.1400749840, 0.5566090496, 0.5566090496, 1.0766711351, 1.0766…
## $ rad <dbl> -0.9818712, -0.8670245, -0.8670245, -0.7521778, -0.7521778, -0…
## $ tax <dbl> -0.6659492, -0.9863534, -0.9863534, -1.1050216, -1.1050216, -1…
## $ ptratio <dbl> -1.4575580, -0.3027945, -0.3027945, 0.1129203, 0.1129203, 0.11…
## $ black <dbl> 0.4406159, 0.4406159, 0.3960351, 0.4157514, 0.4406159, 0.41016…
## $ lstat <dbl> -1.07449897, -0.49195252, -1.20753241, -1.36017078, -1.0254866…
## $ medv <dbl> 0.15952779, -0.10142392, 1.32293748, 1.18158864, 1.48603229, 0…
## $ crime <fct> low, low, low, low, low, low, med_low, med_low, med_low, med_l…
NOTE.
dollar sign is replaced with & due RMarkdown syntax
EDITED CODE: in the Pre-exercise-code of Exercise 4.5, straight AFTER the line that starts with boston_scaled <- read.table and ends with sep=“,”, header = T), you should ADD ONE LINE:
boston_scaled&crime <- factor(boston_scaled&crime, levels = c(“low”, “med_low”, “med_high”, “high”))
# Divide dataset into test (20%) and train (80%)
boston_scaled_KS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt", sep=",", header = T)
boston_scaled_KS$crime <- factor(boston_scaled_KS$crime, levels = c("low", "med_low", "med_high", "high"))
ind <- sample(nrow(boston_scaled_KS), size = nrow(boston_scaled_KS) * 0.8) #creating a indicator that has 80% of the variables
train_4 <- boston_scaled_KS[ind,] #80% of the dataset goes to train dataset
test_4 <- boston_scaled_KS[-ind,] #the remaining 20% (-ind) goes to test dataset
correct_classes_4 <- test_4$crime #save the correct classes from test data
test_4 <- dplyr::select(test_4, -crime) # remove the crime variable from test data
glimpse(test_4) #sanity-check: no crime, rows = 102, columns 13
## Rows: 102
## Columns: 13
## $ zn <dbl> -0.48724019, -0.48724019, 0.04872402, 0.04872402, -0.48724019,…
## $ indus <dbl> -0.5927944, -1.3055857, -0.4761823, -0.4761823, -0.4368257, -0…
## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0…
## $ nox <dbl> -0.7395304, -0.8344581, -0.2648919, -0.2648919, -0.1440749, -0…
## $ rm <dbl> 1.28144555, 0.20689164, -0.38802695, -0.93028528, -0.47769171,…
## $ age <dbl> -0.26554897, -0.35080997, -0.07015919, 1.11638970, -0.24068118…
## $ dis <dbl> 0.556609050, 1.076671135, 0.838414219, 1.086121629, 0.43332522…
## $ rad <dbl> -0.8670245, -0.7521778, -0.5224844, -0.5224844, -0.6373311, -0…
## $ tax <dbl> -0.9863534, -1.1050216, -0.5769480, -0.5769480, -0.6006817, -0…
## $ ptratio <dbl> -0.30279450, 0.11292035, -1.50374851, -1.50374851, 1.17530274,…
## $ black <dbl> 0.39603507, 0.41016511, 0.42637632, 0.32812326, 0.44061589, 0.…
## $ lstat <dbl> -1.20753241, -1.04229087, -0.03123671, 2.41937935, -0.61518350…
## $ medv <dbl> 1.32293748, 0.67055821, 0.03992492, -0.65594629, -0.23189977, …
glimpse(train_4) #sanity-check: crime, rows = 404, columns 14
## Rows: 404
## Columns: 14
## $ zn <dbl> -0.4872402, -0.4872402, -0.4872402, -0.4872402, -0.4872402, -0…
## $ indus <dbl> 1.0149946, 1.2307270, -0.1642450, -0.1642450, 1.2307270, 2.115…
## $ chas <dbl> -0.2723291, 3.6647712, -0.2723291, -0.2723291, -0.2723291, -0.…
## $ nox <dbl> 0.25289548, 0.43412107, -0.06640675, -0.06640675, 0.43412107, …
## $ rm <dbl> -0.40083620, 2.97511331, -0.52892872, -0.27416693, -0.57589598…
## $ age <dbl> 0.92099991, 0.89968466, 0.86415924, 0.95297278, 1.02047107, 0.…
## $ dis <dbl> -0.595876266, -0.775530624, -0.684634922, -0.592219542, -0.667…
## $ rad <dbl> 1.6596029, -0.5224844, -0.4076377, -0.4076377, -0.5224844, -0.…
## $ tax <dbl> 1.52941294, -0.03107419, 0.14099473, 0.14099473, -0.03107419, …
## $ ptratio <dbl> 0.8057784, -1.7347012, -0.3027945, -0.3027945, -1.7347012, 0.2…
## $ black <dbl> -0.27804446, 0.34805866, 0.41925653, 0.44061589, -0.09358721, …
## $ lstat <dbl> 1.21367625, -1.30695741, 0.49809636, 0.62132734, -0.08725079, …
## $ medv <dbl> -0.37324861, 2.98650460, -0.40586757, -0.41674056, -0.37324861…
## $ crime <fct> high, med_high, med_low, med_low, med_high, low, high, med_low…
About the method: In general, LDA method is often use in statistics to find separate groups or clusters (minimum 2) based on different characteristics of the data. LDA has continuous independent variable(s) and a categorical dependent variable. Where factor analysis is creating factors based on the similariries in data, LDA creates them based on differences and defines (in)dependent variables, wheres in factor analysis (especially in exploratory factor analysis) the groups are often data-driven. Also, LDA is used when the groups have already defined (in comparison to cluster analysis). See more details Linear discriminant analysis (LDA).
Here, we already have defined the groups: low, mid_low, mid_high, high. And we want to see how these groups behave based on the other variables in the data.
lda.fit_train <- lda(crime ~ ., data = train_4) # linear discriminant analysis (LDA), "." indicates all other variables
lda.fit_train
## Call:
## lda(crime ~ ., data = train_4)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2376238 0.2549505 0.2475248 0.2599010
##
## Group means:
## zn indus chas nox rm age
## low 1.07040579 -0.9518017 -0.06727176 -0.9000804 0.48503328 -0.9016759
## med_low -0.04598034 -0.3372243 -0.08120770 -0.6020486 -0.09142437 -0.4342430
## med_high -0.36546912 0.1893811 0.31823597 0.3657731 0.20774559 0.4376766
## high -0.48724019 1.0170492 -0.04735191 1.0422336 -0.43271701 0.8168935
## dis rad tax ptratio black lstat
## low 0.9386941 -0.6815949 -0.7262722 -0.46013102 0.3747316 -0.77848497
## med_low 0.4127049 -0.5358646 -0.4730269 -0.09785201 0.3149397 -0.22197009
## med_high -0.3885795 -0.4283101 -0.3220487 -0.31341832 0.1461196 -0.06736579
## high -0.8633897 1.6388211 1.5145512 0.78158339 -0.7479170 0.88715417
## medv
## low 0.56409354
## med_low 0.04393632
## med_high 0.24237996
## high -0.67572477
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11442689 0.68184458 -1.14766565
## indus 0.10497753 -0.30213247 0.20948718
## chas -0.10746203 -0.07191690 -0.02918765
## nox 0.38699548 -0.79233220 -1.21779106
## rm -0.13097186 -0.14999541 -0.20663793
## age 0.17653794 -0.37808265 -0.28733483
## dis -0.07238127 -0.28951234 0.20365790
## rad 3.38648550 0.90966797 -0.10521034
## tax -0.03691444 0.05512577 0.62191235
## ptratio 0.10721692 0.03263558 -0.34643007
## black -0.12699946 0.03741652 0.12316562
## lstat 0.22636243 -0.06071022 0.34163569
## medv 0.23039780 -0.25545986 -0.11245266
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9490 0.0377 0.0133
Interpretation of the results
Interpretention of the results are adapted from Linear Discriminant Analysis in R - finnstats and Discriminant Analysis Essentials in R.
Using the function plot() produces biplots of the linear discriminant, obtained by computing LD1 and LD3 for each of the training observations.
lda.arrows_train <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes_4 <- as.numeric(train_4$crime) # target classes as numeric 4 classes: low, mid_low, mid_high, high
# plot the lda results
plot(lda.fit_train, dimen = 2, col = classes_4, pch = classes_4)
lda.arrows_train(lda.fit_train, myscale = 1)
Interpretation of the plot
Similar to the coefficients of linear discriminants values in previous table, rad (index of accessibility to radial highways) seem to have most impact of differianting the clusters. Based on the plot, the coefficient, and group mean rad values seem to be different between high crime rate and other categories. This means that the rad has most impact when differentiating the clusters. Moreover, having access in radial highways seem to be most associated with high criminal rate.
The first two steps have already been done. The test data set should does not contain “crime” variable.
# The first two steps have already been done. The test dataset does not contain "crime" variable.
glimpse(test_4) #sanity check, no crime
## Rows: 102
## Columns: 13
## $ zn <dbl> -0.48724019, -0.48724019, 0.04872402, 0.04872402, -0.48724019,…
## $ indus <dbl> -0.5927944, -1.3055857, -0.4761823, -0.4761823, -0.4368257, -0…
## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0…
## $ nox <dbl> -0.7395304, -0.8344581, -0.2648919, -0.2648919, -0.1440749, -0…
## $ rm <dbl> 1.28144555, 0.20689164, -0.38802695, -0.93028528, -0.47769171,…
## $ age <dbl> -0.26554897, -0.35080997, -0.07015919, 1.11638970, -0.24068118…
## $ dis <dbl> 0.556609050, 1.076671135, 0.838414219, 1.086121629, 0.43332522…
## $ rad <dbl> -0.8670245, -0.7521778, -0.5224844, -0.5224844, -0.6373311, -0…
## $ tax <dbl> -0.9863534, -1.1050216, -0.5769480, -0.5769480, -0.6006817, -0…
## $ ptratio <dbl> -0.30279450, 0.11292035, -1.50374851, -1.50374851, 1.17530274,…
## $ black <dbl> 0.39603507, 0.41016511, 0.42637632, 0.32812326, 0.44061589, 0.…
## $ lstat <dbl> -1.20753241, -1.04229087, -0.03123671, 2.41937935, -0.61518350…
## $ medv <dbl> 1.32293748, 0.67055821, 0.03992492, -0.65594629, -0.23189977, …
Implement and predict training model (train_4) for testing data (test_4).
lda.pred_test <- predict(lda.fit_train, newdata = test_4) # predict classes with test data
table(correct = correct_classes_4, predicted = lda.pred_test$class) # cross tabulate the results
## predicted
## correct low med_low med_high high
## low 13 16 2 0
## med_low 1 16 6 0
## med_high 0 10 14 2
## high 0 0 0 22
Interpretation of the results
Overall, the using test data we manage to predict correct 68 values (67%).
Reload the Boston dataset
data("Boston")
boston_scl <- as.data.frame(scale(Boston))
st(boston_scl)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| crim | 506 | 0 | 1 | -0.419 | -0.411 | 0.007 | 9.924 |
| zn | 506 | 0 | 1 | -0.487 | -0.487 | 0.049 | 3.8 |
| indus | 506 | 0 | 1 | -1.556 | -0.867 | 1.015 | 2.42 |
| chas | 506 | 0 | 1 | -0.272 | -0.272 | -0.272 | 3.665 |
| nox | 506 | 0 | 1 | -1.464 | -0.912 | 0.598 | 2.73 |
| rm | 506 | 0 | 1 | -3.876 | -0.568 | 0.482 | 3.552 |
| age | 506 | 0 | 1 | -2.333 | -0.837 | 0.906 | 1.116 |
| dis | 506 | 0 | 1 | -1.266 | -0.805 | 0.662 | 3.957 |
| rad | 506 | 0 | 1 | -0.982 | -0.637 | 1.66 | 1.66 |
| tax | 506 | 0 | 1 | -1.313 | -0.767 | 1.529 | 1.796 |
| ptratio | 506 | 0 | 1 | -2.705 | -0.488 | 0.806 | 1.637 |
| black | 506 | 0 | 1 | -3.903 | 0.205 | 0.433 | 0.441 |
| lstat | 506 | 0 | 1 | -1.53 | -0.799 | 0.602 | 3.545 |
| medv | 506 | 0 | 1 | -1.906 | -0.599 | 0.268 | 2.987 |
# Create categorical variable
boston_scl$crim <- as.numeric(boston_scl$crim)
bins_scl <- quantile(boston_scl$crim) #using quantile (counts min, 25%, 50% (median), 75% and 100%)
bins_scl # save it as a vector, so you can use thos as a cut-offs
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# 0% 25% 50% 75% 100%
# -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime_scl <- cut(boston_scl$crim, breaks = bins_scl, labels=c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
summary(crime_scl) #you can see that variable now has 4 even categories (low=127, med_low=126, med_high=126, high=127)
## low med_low med_high high
## 127 126 126 127
# Drop crim and add crime
boston_scl <- dplyr::select(boston_scl, -crim) #discard the old crim variable using -crim in the scaled Boston data
boston_scl <- data.frame(boston_scl, crime_scl) #add the new crime categorical variable
st(boston_scl) #sanity-check
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| zn | 506 | 0 | 1 | -0.487 | -0.487 | 0.049 | 3.8 |
| indus | 506 | 0 | 1 | -1.556 | -0.867 | 1.015 | 2.42 |
| chas | 506 | 0 | 1 | -0.272 | -0.272 | -0.272 | 3.665 |
| nox | 506 | 0 | 1 | -1.464 | -0.912 | 0.598 | 2.73 |
| rm | 506 | 0 | 1 | -3.876 | -0.568 | 0.482 | 3.552 |
| age | 506 | 0 | 1 | -2.333 | -0.837 | 0.906 | 1.116 |
| dis | 506 | 0 | 1 | -1.266 | -0.805 | 0.662 | 3.957 |
| rad | 506 | 0 | 1 | -0.982 | -0.637 | 1.66 | 1.66 |
| tax | 506 | 0 | 1 | -1.313 | -0.767 | 1.529 | 1.796 |
| ptratio | 506 | 0 | 1 | -2.705 | -0.488 | 0.806 | 1.637 |
| black | 506 | 0 | 1 | -3.903 | 0.205 | 0.433 | 0.441 |
| lstat | 506 | 0 | 1 | -1.53 | -0.799 | 0.602 | 3.545 |
| medv | 506 | 0 | 1 | -1.906 | -0.599 | 0.268 | 2.987 |
| crime_scl | 506 | ||||||
| … low | 127 | 25.1% | |||||
| … med_low | 126 | 24.9% | |||||
| … med_high | 126 | 24.9% | |||||
| … high | 127 | 25.1% |
boston_scl$crime_scl <- factor(boston_scl$crime_scl, levels = c("low", "med_low", "med_high", "high"))
Count distances: Euclidean distance matrix is default, to use manhattan distance matrix, specify method=“manhattan”
# Distances between the observations
# ?dist
dist_eu_scl <- dist(boston_scl)
## Warning in dist(boston_scl): NAs introduced by coercion
summary(dist_eu_scl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1394 3.5267 4.9081 4.9394 6.2421 13.0045
dist_man_scl <- dist(boston_scl, method="manhattan")
## Warning in dist(boston_scl, method = "manhattan"): NAs introduced by coercion
summary(dist_man_scl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2849 8.8386 13.0300 13.8657 18.0953 46.8948
NOTE. I got an error code Warning: NAs introduced by coercion . When crime_scl was a factor. When numeric no error message occured.
Treat crime_scl as numeric rather than factor.
# Distances between the observations
boston_scl$crime_scl <- as.numeric(boston_scl$crime_scl)
dist_eu_scl <- dist(boston_scl)
summary(dist_eu_scl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.5760 4.9401 4.9913 6.3033 12.8856
dist_man_scl <- dist(boston_scl, method="manhattan")
summary(dist_man_scl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2645 8.9648 13.2765 14.1297 18.5263 46.5452
Interpretation of the results
The top one is the euclidean distance and below
manhattan distance.
The distance is used to measure the dissimilarity between vectors.
dist() command calculates the distance between several
vectors (distance between the rows) in a matrix (boston_scl). Larger
values indicate that there are greater difference between the rows and
therefore, there might be some distinct patterns between different
groups.
In general, I have never counted distances, so I am not very familiar with this approach. I found this page quite useful to explain it. If I understood correct, the distance is also arbitrary number, meaning it does not have specific scale to use (like in correlation: -1 to 1). If the distance is zero the points are identical (similar), whereas high values indicate little similarity.
I also found this page helpful to explain the basics.
K-means algorithm NOTE. I used the standardized (boston_scl) dataset, not the raw dataset, like in the exercise.
# Investigate what is the optimal number of clusters and run the algorithm again
set.seed(123) #This sets the "spot" where the kmeas()-function will start to generate the k-mean clusters. Setting seed assures that we get the same results each time. Otherwise, it would generate a new k-means each time (different starting point).
k_max_scl <- 10 #define max amount of clusters
twcss_scl <- sapply(1:k_max_scl, function(k){kmeans(boston_scl, k)$tot.withinss}) # calculate the total within sum of square
qplot(x = 1:k_max_scl, y = twcss_scl, geom = 'line', xlim = c(1,10)) # visualize the results
Interpretation of the plot The optimal number of clusters is when the value of total WCSS changes radically. Based on the plot, the most optimal number of clusters would be 2-4 clusters.
Next, we are going to generate and plot these models and then choose, which was is the most optimal.
Choosing the best model & number of clusters
km_2 <- kmeans(boston_scl, centers = 2) # 2 indicates the number of clusters we are generating
km_3 <- kmeans(boston_scl, centers = 3)
km_4 <- kmeans(boston_scl, centers = 4)
# plot the Boston dataset with clusters. Only view the column between 5-10 (nox, rm, age, dis, rad, tax)
pairs(boston_scl, col = km_2$cluster) #2 clusters: red and black
pairs(boston_scl, col = km_3$cluster) #3 clusters: red, black and green
pairs(boston_scl, col = km_4$cluster) #4 clusters: red, black, green and blue
Interpretation of the plots
We try to find the most optimal number of clusters (minimal overlapping) and detect some variables that seem to have the most distinct clusters.
Next, we can have a look at the same plots, but focuding only rm, age, dis, rad and ptratio variables
pairs(boston_scl[5:10], col = km_2$cluster) #2 clusters: red and black
pairs(boston_scl[5:10], col = km_3$cluster) #3 clusters: red, black and green
pairs(boston_scl[5:10], col = km_4$cluster) #4 clusters: red, black, green and blue
Interpretation of the plot
Therefore, maybe model with 2-3 clusters are the most optimal to describe the data.
Read and standardize
data("Boston")
boston_bonus <- as.data.frame(scale(Boston))
st(boston_bonus) #sanity check
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| crim | 506 | 0 | 1 | -0.419 | -0.411 | 0.007 | 9.924 |
| zn | 506 | 0 | 1 | -0.487 | -0.487 | 0.049 | 3.8 |
| indus | 506 | 0 | 1 | -1.556 | -0.867 | 1.015 | 2.42 |
| chas | 506 | 0 | 1 | -0.272 | -0.272 | -0.272 | 3.665 |
| nox | 506 | 0 | 1 | -1.464 | -0.912 | 0.598 | 2.73 |
| rm | 506 | 0 | 1 | -3.876 | -0.568 | 0.482 | 3.552 |
| age | 506 | 0 | 1 | -2.333 | -0.837 | 0.906 | 1.116 |
| dis | 506 | 0 | 1 | -1.266 | -0.805 | 0.662 | 3.957 |
| rad | 506 | 0 | 1 | -0.982 | -0.637 | 1.66 | 1.66 |
| tax | 506 | 0 | 1 | -1.313 | -0.767 | 1.529 | 1.796 |
| ptratio | 506 | 0 | 1 | -2.705 | -0.488 | 0.806 | 1.637 |
| black | 506 | 0 | 1 | -3.903 | 0.205 | 0.433 | 0.441 |
| lstat | 506 | 0 | 1 | -1.53 | -0.799 | 0.602 | 3.545 |
| medv | 506 | 0 | 1 | -1.906 | -0.599 | 0.268 | 2.987 |
# Create categorical variable
boston_bonus$crim <- as.numeric(boston_bonus$crim)
bins_b <- quantile(boston_bonus$crim) #using quantile (counts min, 25%, 50% (median), 75% and 100%)
bins_b # save it as a vector, so you can use those as a cut-offs
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# 0% 25% 50% 75% 100%
# -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
st(boston_bonus)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| crim | 506 | 0 | 1 | -0.419 | -0.411 | 0.007 | 9.924 |
| zn | 506 | 0 | 1 | -0.487 | -0.487 | 0.049 | 3.8 |
| indus | 506 | 0 | 1 | -1.556 | -0.867 | 1.015 | 2.42 |
| chas | 506 | 0 | 1 | -0.272 | -0.272 | -0.272 | 3.665 |
| nox | 506 | 0 | 1 | -1.464 | -0.912 | 0.598 | 2.73 |
| rm | 506 | 0 | 1 | -3.876 | -0.568 | 0.482 | 3.552 |
| age | 506 | 0 | 1 | -2.333 | -0.837 | 0.906 | 1.116 |
| dis | 506 | 0 | 1 | -1.266 | -0.805 | 0.662 | 3.957 |
| rad | 506 | 0 | 1 | -0.982 | -0.637 | 1.66 | 1.66 |
| tax | 506 | 0 | 1 | -1.313 | -0.767 | 1.529 | 1.796 |
| ptratio | 506 | 0 | 1 | -2.705 | -0.488 | 0.806 | 1.637 |
| black | 506 | 0 | 1 | -3.903 | 0.205 | 0.433 | 0.441 |
| lstat | 506 | 0 | 1 | -1.53 | -0.799 | 0.602 | 3.545 |
| medv | 506 | 0 | 1 | -1.906 | -0.599 | 0.268 | 2.987 |
crime_b <- cut(boston_bonus$crim, breaks = bins_b, labels=c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
summary(crime_b) #you can see that variable now has 4 even categories (low=127, med_low=126, med_high=126, high=127)
## low med_low med_high high
## 127 126 126 127
boston_bonus <- dplyr::select(boston_bonus, -crim) #discard the old crim variable using -crim in the scaled Boston data
boston_bonus <- data.frame(boston_bonus, crime_b) #add the new crime categorical variable
boston_bonus$crime_b <- factor(boston_bonus$crime_b, levels = c("low", "med_low", "med_high", "high"))
st(boston_bonus)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| zn | 506 | 0 | 1 | -0.487 | -0.487 | 0.049 | 3.8 |
| indus | 506 | 0 | 1 | -1.556 | -0.867 | 1.015 | 2.42 |
| chas | 506 | 0 | 1 | -0.272 | -0.272 | -0.272 | 3.665 |
| nox | 506 | 0 | 1 | -1.464 | -0.912 | 0.598 | 2.73 |
| rm | 506 | 0 | 1 | -3.876 | -0.568 | 0.482 | 3.552 |
| age | 506 | 0 | 1 | -2.333 | -0.837 | 0.906 | 1.116 |
| dis | 506 | 0 | 1 | -1.266 | -0.805 | 0.662 | 3.957 |
| rad | 506 | 0 | 1 | -0.982 | -0.637 | 1.66 | 1.66 |
| tax | 506 | 0 | 1 | -1.313 | -0.767 | 1.529 | 1.796 |
| ptratio | 506 | 0 | 1 | -2.705 | -0.488 | 0.806 | 1.637 |
| black | 506 | 0 | 1 | -3.903 | 0.205 | 0.433 | 0.441 |
| lstat | 506 | 0 | 1 | -1.53 | -0.799 | 0.602 | 3.545 |
| medv | 506 | 0 | 1 | -1.906 | -0.599 | 0.268 | 2.987 |
| crime_b | 506 | ||||||
| … low | 127 | 25.1% | |||||
| … med_low | 126 | 24.9% | |||||
| … med_high | 126 | 24.9% | |||||
| … high | 127 | 25.1% |
K-means
#k-means
boston_bonus$crime_b <- as.numeric(boston_bonus$crime_b)
set.seed(123)
k_max <- 10
twcss_b <- sapply(1:k_max, function(k){kmeans(boston_bonus, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss_b, geom = 'line', xlim = c(1,10))
#based on the plot 2-3 clusters are the most optimal
Interpretation of the plot
Model with 2-3 clusters seem to desrcibe our data the best.
bkm_2 <- kmeans(boston_bonus, centers = 2)
pairs(boston_bonus, col = bkm_2$cluster)
bkm_3 <- kmeans(boston_bonus, centers = 3)
pairs(boston_bonus, col = bkm_3$cluster)
#based on the plots model with 2 clusters seem to be more optimal. The third cluster seems to get very similar results than the other clusters.
Interpretation of the plot
Therefore, maybe model with 2 clusters is the most optimal to describe our data. But, since we need to have more than 2 clusters based on the guidelines: Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). We run the following model with 3 clusters.
Run LDA
For some reason when I tried to knit the Assignment 4, this did not work anymore, but instead gave me an error code: Error in x - group.means[g, ] : non-conformable arrays . Unsure, how to fix it. Below is a screenshot of the model, when it did work.
Here is the code in blue
NOTE. The dollar sign has replaces with &, due to RMarkdown syntax
boston_bonus <- data.frame(boston_bonus, crime_b) #add the new crime categorical variable boston_bonus&crime_b <- factor(boston_bonus&crime_b, levels = c(“low”, “med_low”, “med_high”, “high”))
library(MASS)
boston_bonus&crime_b <- as.numeric(boston_bonus&crime_b)
lda.fit_b <- lda(bkm_3&cluster ~ ., data = boston_bonus) #LDA using the clusters as target classes
lda.fit_b
Picture 1
Interpretation of the results
lda.arrows_b <- function(x, myscale = 1, arrow_heads = 0.1, color = “red”, tex = 0.75, choices = c(1,2)){
heads <- coef(x) arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3)
}
classes_b <- as.numeric(bkm_3$cluster) # target classes as numeric
# plot the lda results plot(lda.fit_b, dimen = 2, col = classes_b, pch = classes_b)
lda.arrows_b(lda.fit_b, myscale = 1)
Picture 2
Interpretation of the plot
zn (proportion of residential land zoned for lots over 25,000 sq.ft.) seem to be the most influential linear separators between the clusters, followd by age, crime, tax and nox.
model_predictors_sb <- dplyr::select(train_4, -crime) #the LD values for the first 3D plot is generated by train_4 data (80%).
# check the dimensions
dim(model_predictors_sb)
## [1] 404 13
dim(lda.fit_train$scaling)
## [1] 13 3
# matrix multiplication
matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit_train$scaling
matrix_product_sb <- as.data.frame(matrix_product_sb)
Install and access plotly and draw 3D plot
#install.packages("plotly")
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = train_4$crime)
3D plot based on k-means The second plot needs to be separated by k-means. However, train_4 data (80%) was not used to calculate k-means. Instead, based on the assignments, we needed to use the complete Boston dataset. The standardized datasets are:
data("Boston")
boston_sb <- as.data.frame(scale(Boston))
boston_sb$crim <- as.numeric(boston_sb$crim) # Create categorical variable
bins_sb <- quantile(boston_sb$crim)
crime_sb <- cut(boston_sb$crim, breaks = bins_sb, labels=c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
boston_sb <- dplyr::select(boston_sb, -crim)
boston_sb <- data.frame(boston_sb, crime_sb)
boston_sb$crime_sb <- as.numeric(boston_sb$crime_sb)
sbkm_3 <- kmeans(boston_sb, centers = 3)
boston_sb$crim <- as.numeric(boston_sb$crim)
lda.fit_sb <- lda(sbkm_3$cluster ~ ., data = boston_sb) #LDA using the clusters as target classes
## Warning in lda.default(x, grouping, ...): variables are collinear
lda.fit_sb
## Call:
## lda(sbkm_3$cluster ~ ., data = boston_sb)
##
## Prior probabilities of groups:
## 1 2 3
## 0.1897233 0.4762846 0.3339921
##
## Group means:
## zn indus chas nox rm age
## 1 1.6956974 -1.0817907 -0.067271764 -1.1173803 0.66096776 -1.31662020
## 2 -0.3337899 -0.3632658 0.021728211 -0.3404473 0.05360582 -0.03985203
## 3 -0.4872402 1.1325382 0.007228345 1.1202149 -0.45190478 0.80473300
## dis rad tax ptratio black lstat medv
## 1 1.40553832 -0.6193863 -0.6437607 -0.6631769 0.3585763 -0.9180096 0.7410061
## 2 0.03865169 -0.5710917 -0.6216579 -0.1578980 0.2878179 -0.2319868 0.1757696
## 3 -0.85353099 1.1662378 1.2521927 0.6018841 -0.6141269 0.8522943 -0.6715802
## crime_sb crim
## 1 1.281250 1.281250
## 2 2.132780 2.132780
## 3 3.715976 3.715976
##
## Coefficients of linear discriminants:
## LD1 LD2
## zn -0.121076625 1.40439032
## indus 0.815181218 0.37919114
## chas -0.053348063 0.02575069
## nox 1.213770032 0.59654110
## rm -0.013376446 0.16375801
## age -0.007550386 -0.49308296
## dis -0.021956895 0.49261234
## rad 0.630308088 0.29865470
## tax 0.886895618 0.64969738
## ptratio 0.330976614 0.15027688
## black -0.002217404 -0.04048125
## lstat 0.260179025 0.22374889
## medv 0.149859384 0.23550344
## crime_sb -0.054851160 -0.15575154
## crim -0.054851160 -0.15575154
##
## Proportion of trace:
## LD1 LD2
## 0.8426 0.1574
model_predictors_sb <- dplyr::select(boston_sb, -crime_sb)
dim(model_predictors_sb)
## [1] 506 14
dim(lda.fit_sb$scaling)
## [1] 15 2
boston_sb$crime_sb <- factor(boston_sb$crime_sb, levels = c("low", "med_low", "med_high", "high"))
boston_sb$crime_sb <- as.numeric(boston_sb$crime_sb)
# matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit_sb$scaling
The last lines matrix_product_sb <- as.matrix(…) gives an ERROR CODE
NOTE. The dollar sign has replaces with & and a star with extra %
Error in as.matrix(model_predictors_sb) %%% lda.fit_sb&scaling : non-conformable arguments
Install and access plotly and draw 3D plot
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = boston_sb$clusters)
Interpret results
Unsure, why it is not working. Any suggestions?
This code is for the next week’s data!
See create_human.R and human.csv
End of assignment 4